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Abstract. Reinforcement learning has solid foundations, but becomes inefficient 
in partially observed (non-Markovian) environments. Thus, a learning agent - 
bom with a representation and a policy - might wish to investigate to what ex- 
tent the Markov property holds. We propose a learning architecture that utilizes 
combinatorial policy optimization to overcome non-Markovity and to develop 
efficient behaviors, which are easy to inherit, tests the Markov property of the 
behavioral states, and coiTects against non-Markovity by running a deterministic 
factored Finite State Model, which can be learned. We illustrate the properties of 
architecture in the near deterministic Ms. Pac-Man game. We analyze the archi- 
tecture from the point of view of evolutionary, individual, and social learning. 

Keywords: Inherited rules, reinforcement learning, combinatorial explosion, 
look-ahead, Markov property 



1 Introduction 

We are concerned with real world scenarios, where decision making can be arbitrarily 
poor if available sensory information is insufficient or if some older, yet still ongoing or 
implicating events are not taken into account. If past information can not improve deci- 
sion making then state description is Markovian and the problem belongs to the realm 
of Markov Decision Processes (MDPs), which have solid theoretical foundations. The 
only remaining question for MDPs is whether the description fits into the memory and 
whether optimization can be solved in reasonable time or not. The size of the problem 
grows exponentially with the number of variables. Therefore in order to reduce problem 
size, state description has to be compressed by extracting relevant features of the deci- 
sion. As a result, factored reinforcement learning (RL) is obtained, which once again 
raises the question whether compression spared the Markov property or not. 

We are interested in scenarios, when the world is close-to-deterministic. Intrigu- 
ingly, unless we refer to quantum mechanics, a well-informed Laplace's demon can 
make decent close-to-deterministic approximations iSSfl Surprisingly, this close-to- 
deterministic nature may hold - to a great extent - to everyday human activities, too 
II22II . With regards to human cooperation, there is a strong correlation between the abil- 
ity of reading the hidden emotional and cognitive 'parameters' of the partners and col- 
laborative (social) skills [7 1 indicating the necessity of Markovian state description in 



Note that we read the theorem backwards. 
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behavior optimization. Last, but not least, MDP formalism tacitly includes the concept 
of determinism: states are perfectly known to the agent within a finite time window. In 
turn, the concept of state hides close-to-deterministic finite time processes in MDP. 
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Fig. 1. Learning in a continuously changing world. 

(a): Questions for RL in a changing world, (b): Methods to overcome non-Markovian character 
of state description, and (c): Architecture for a changing world: agent in is bom with a rela- 
tively small set of low-complexity rules selected by evolution. Temporal difference (TD) error 
may uncover non-Markovian character of the behavioral states. Decision surfaces based on TD 
learning may diminish non-Markovian character. Lookahead method is valuable if factored and 
close-to-deterministic finite state model of the environment can be learned. 



Markovian description of the environment is questionable since inherited rules and 
representations generated by observations may be non-Markovian, or the world may 
change and the Markovian property may be lost, or agents may change due to the 
changes of other agents' behavior, and so on. 
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In turn, we are interested in learning agents. We restrict our considerations to envi- 
ronment, which can be approximated by close-to-deterministic Finite-State Machines 
(FSMs) or can be factored to such FSMs (fFSM). The practical points are that (i) the 
so-called factored MDP model 161101 is tractable from the point of view of optimization 
ll26ll . and (ii) near-deterministic processes are relatively easy to learn. 

We address the following issues here: Does the problem description possess 
Markov-property? Is it feasible to solve the underlying learning problem? If answers 
are negative then which features are to be learned to improve problem description? 

These questions are illustrated on Fig. |l(a)| Figure [T(b)] shows certain opportuni- 
ties for checking the Markov property without increasing temporal depth. This occurs 
by considering the difference of experienced and expected approximations of long- 
term cumulated discounted rewards; the temporal difference (TD) errors. In the Fig- 
ure, we depicted the optimization for non-Markovian state description by direct global 
policy search as well as a method that may overcome traps posed by non-Markovian 
states via look-ahead based value estimation. We will not address the delicate issue of 
pulling out close-to-deterministic features/factors from high dimensional uncertain ob- 
servations: this might be the main bottleneck of artificial general intelligence (AGI). 
Instead, we demonstrate that policy optimization may lead to good performance with 
non-Markovian behavioral states, look-ahead using close-to-deterministic factors can 
quickly overcome poor value estimations of those non-Markovian states and can be 
used to improve decision making. 

Figure[T(c)]represents our main concepts about a learning ag ent in a changing world: 
we assume that the agent is born with a low-complexity rule set that can save her This 
rule set can be developed, e.g., by evolution and we use selective global policy opti- 
mization. The rule set, however, may lead to non-Markovian states (the behaviors) if 
the rules are crude or they do not match the actual world. The agent starts to learn by 
estimating the values of its behavioral states: it collects information about the distribu- 
tion of TD errors of these states. Using the TD errors, the agent may develop new binary 
sensors (decision surfaces) in order to improve state description and decision making. 
Also, the agent may learn close-to-deterministic fFSMs to overcome problems of value 
estimations for non-Markovian states by utilizing look-ahead methods, since (i) look- 
ahead is efficient for close-to-deterministic fFSMs and (ii) look-ahead can easily correct 
poor value estimations. 

The paper is built as follows: Markovity and related RL concepts are reviewed in 
Section[2] We executed experimental demonstrations on the Ms. Pac-Man game in order 
to illustrate the concepts of Fig. |l(c)| Ms. Pac-Man was chosen because of our previous 
experiences 1251 . In this paper, considerations are restricted to the single Ms. Pac-Man 
case (Section|3]l. However, if proper (mind reading) sensors are available that enable the 
close-to-deterministic fFSM characterization of the other agent(s), then our methods 
may have value for the multi-agent case lfT6l . 

Results of our experiments are discussed in Section |4] We connect our work to 
theoretically proven favorable features of factored RL, but leave aside the problem of 
learning the model of the environment. We elaborate on the concept of state in Subsec- 
tion l4.2l We conclude in Section|5] 
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2 Reinforcement learning 

During the last two decades, reinforcement learning has reached a mature state, and 
has been laid on solid foundations. We have a large variety of algorithms, including 
value- function-based, direct policy search and hybrid methods. For reviews on these 
subjects, see, e.g., 1212312411 and references therein. RL deals with sequential decision 
tasks, where decisions of an agent are not based on information about locally optimal 
actions, but its strategy is rather relying on a scalar reward associated to each state 
transitions. MDP formalism encapsulates the RL task as follows; 

A Markov Decision Process comprise of a quadruple {S, A, P, R}, where 5* is a 
finite set of states, A is a finite set of actions, P : S x Ax S ^ [0, 1] is a state transition 
probability telling the probability of arriving at state s' after executing action a in state 
s. R : S X A X S ^ M.is the reward function, R{s, a, s') is the immediate reward for 
transition (s, a, s'). 

The behavior of an agent is described by its policy a tt : S x A ^ [0,1] function, 
which assigns a probability to an action in each state. For a given policy, each state 
has an associated value, which is the expectation value of the long-term cumulated 
discounted immediate reward: 



where < 7 < 1 is the discount factor and rt+k+i is the immediate reward in the 
(fc + 1)** step, denotes the expectation operator for policy tt. The goal of an RL 
agent is to find an optimal policy that chooses the actions giving rise to the largest 
values in each state. 

When Markov-property holds, the optimal policy (tt*) is the greedy policy and the 
corresponding value function is the optimal value function (V*): 



2.1 Solving reinforcement learning problems 

Many practical applications have appeared that utilize RL. Algorithms with good online 
performance are known and their asymptotic and finite-state behavior is well understood 
in most cases. 

Dynamic programming When S, A, P, R are available then an iterative method com- 
prised of improvement and greedification steps can be used to obtain an optimal policy. 
This approach may fall short when S is large or the model is not available. 

Temporal differences Temporal Differences (TD) and its most simple form TD(0) can 
be used to build an appropriate value function if matrices P and R are not known or are 
unmanageably large; 




7r*(s) = argmax 
a 
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Regarding time instants t = 0, 1, 2... assume that in state st+i reward rt+i is ob- 
served, while value estimation at time t give rise to Vf(st) and Vt{st+i) for states st 
and st+i, respectively. Then the TD(0) error of the estimation is 



According to TD(0) learning, value estimation is updated in the following way: 



where a is the learning factor. 

Many real problems resist to RL, e.g., if state description is non-Markovian or if the 
number of variables that describe the state is large since the size of the state space grows 
exponentially with the number of variables. Solutions to such problem are considered 
below. 

Policy search, CE method The lack of the Markovian property can be circumvented 
to some extent by global policy search methods that do not rely on a value function 
but rather an explicit parametric representation of policies are maintained and updated 
according to a measure of rewards acquired during one or more trial episodes. It has 
been shown that the theoretically firm Cross-entropy Method (CEM) 1211 is efficient in 
policy search ||3T| . Furthermore, CEM can deal with non-Markovian state-descriptions 
and the optimization of multiple actions (action combinations) acting simultaneously, 
enabling the optimization of factored action representations 1251 . 

CEM works as follows: Let x — {xi, ...,a;„) G X he a discrete vector and S : 
X — > R be a so-called goal or fitness function, which defines the fitness of a particular 
solution vector x. The goal is to find the vector with the highest corresponding fitness 
value: 



CEM maintains a distribution from a family of parametric distributions (G) over the set 
of possible solutions X and converges towards a distribution in this family whose den- 
sity is the highest in an arbitrary environment of the optimal solution x*. The algorithm 
has an iterative approach, where t = 0,1, T marks subsequent iterations: 

In each iteration, let N mark the number of independent sample points 
x'^) £ X) drawn from the actual distribution g{t). In case of arbitrary 6* G M 
the set of high-valued sample points is: 



which converges to the level set belonging to 7, i.e., to : Lg := {x|S'(x) > 9} 

For high 6 values a uniform distribution over the level set Lg is concentrated around 
the global optimum x* . During the optimization, 9 has to be adjusted adaptively because 
at the beginning it is highly likely that the sample set Lg is empty, for high 9 values. 
CEM specifies a p e [0,1] ratio, which describes the number of items in the level 



St+i = n+i + jVt{st+i) - Vt{st) 




X* = argmax{5(x)} 
X G X 
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set Lg and therefore determines the actual 6 value. The best N x p items are called 
eUte samples and they typically comprise of 2-10% of the samples. If the samples are 
ordered by descending fitness value, then the value of 6 can be calculated from p as 

:= /(x(''^)). 

After sorting the samples in this way, distribution ^(x) can be updated so that 
the Kullback-Leibler divergence between a uniform distribution over the approximated 
level set Lg and the distribution decreases. For more details, see llTSI . 

CEM is a selective batch method and it resembles to genetic algorithms in many 
respects. CEM can be extended to online optimization ll28l . which may suit real world 
RL problems better 

The exponential blow-up of the state space can be treated with function approxima- 
tions and factored-state MDPs that are tied together: 

Function approximation When the state- or action-space is large or continuous, func- 
tion approximation methods (FAPPs) can also be involved, which create value functions 
by the approximation of sample points while using supervised learning methods, but 
now those are controlled by RL. In the case of FAPPs the goal is to create an explicit 
representation of approximated value functions and parameters are typically updated 
after each consecutive step. For convergent general algorithms see |[20l and the refer- 
ences therein. Special FAPPs that have temporal dynamics can also be used to diminish 
consequences of non-Markovian observations ll30l . 

Factored-state Markov Decision Processes We assume that S is the Cartesian product 
of m smaller state spaces (corresponding to individual variables): 

S* = S"! X 5*2 X ... X S,n, 

where Si has the size \Si\ <C jS*] (i = 1, . . . , m). This is the case of factored Markov 
Decision Process (fMDP, or factored RL) 01415161 . A naive, tabular representation of 
the transition probabilities would require exponentially large space (that is, exponential 
in the number of variables m). However, the next-step value of a state variable often 
depends only on a few other variables, so the full transition probability can be obtained 
as the product of several simpler factors. It has been shown that by uniformly sampling 
polynomially many samples from the (exponentially large) state space, the complexity 
of the fMDP solver stays polynomial in the size of the fMDP description length, i.e. 
ill 1261 . This convergent fMDP solver makes use of function approximations 

and ties fMDPs to FAPPs. 

Factored MDPs may also take advantage of the module concept of RL, see below. 

RL modules The concept of time step is very flexible in MDPs. A particular solution 
defines time intervals by actions: the state is described by the available modules, i.e., 
the complex actions that can be used at a given time. Then state changes only when 
the set of available modules changes and the new set defines the new state, whereas 
the activated module of the set defines the action in that state ||T2| . This formulation is 
advantageous for two reasons. For one, the formulation can easily be extended by robust 
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controllers ll32l . In addition the formulation enables to activate more than one module 
- a macro - at a time lIZSl giving rise to combinatorial flexibility, factored descriptions, 
and behavioral states upon optimization as we shall see it later. 

Model construction and planning in RL Learning of the transition probability matrix, 
the dynamical model of RL, offers several advantages. First of all, this model enables 
planning. It is useful if value estimation is imprecise, for example if rewards may dis- 
appear, or new rewards may appear In this case, one can run the learned model, like in 
dynamic programming. Typical usage of the model computes the expected value of a 
state by summing up discounted future rewards according to the dynamical model and 
the policy. Advantages appear if real world experimentation is limited, and some of the 
rewards, certain transition probabilities, or a few of the factors may change, but can 
be learned easily. Upon observation of such changes, the dynamical model can update 
value estimation without further real world experiments. 

The other advantage of model learning is that the exploration exploitation dilemma 
can be resolved under certain conditions, both for RL [i2? | and for factored RL B29i . 

Factored Finite State Model approximation for planning Planning in the Ms. Pac- 
Man game (see later fZ5\ ) is severely restricted by the size of the state space. On the 
other hand, the state transition can be easily described with factors. For example, if 
Ms. Pac-Man walks over a dot then it disappears. Learning of this 'machine' is very 
simple. Walls and corridors of the game are also simple 'machines' since they do not 
change. There are other, somewhat more complex machines in the game; they have a bit 
of probabilistic behavior All of these components are simple to learn and thus an ap- 
proximate sampling based ifTsl or a risk avoiding lH] (close-to-)deterministic (factored) 
model can be run by Ms. Pac-Man for a low cost and with low memory demand for the 
sake of look-ahead (planning). 

Note that look-ahead is typical in games, including Ms. Pac-Man (see e.g. |[33l ) and 
it is also considered in the MDP literature (see, e.g., |9| and the references therein). 
Here, we consider look-ahead from the point of view of a general RL architecture that 
inherits knowledge, learns, and faces new challenges. There are many methods to solve 
the FSM identification task H, provided that the close-to-deterministic features are 
available. 

3 Architecture through an illustrative example 

In this section we illustrate the logic of the algorithmic components as shown in 
Fig. |l(c)| We assume a reinforcement learning scenario, where an MDP possesses a 
large action- and/or state-space but no method is known to efficiently compress the in- 
formation via feature extraction, while preserving the Markov-property of the descrip- 
tion. We use the Ms. Pac.Man game for the illustration. 

3.1 Pac-Man and reinforcement learning 

The video-game Pac-Man was first released in 1979, and it is considered to be one of 
the most popular video games to date [1341 . 



Score: 1110 Lives: 2 
Fig. 2. A snapshot of the Pac-Man game 

The player maneuvers Pac-Man in a maze (see Fig. |2]l, while Pac-Man eats the dots 
in the maze. In this particular maze there are 174 dots, each one is worth 10 points. A 
level is finished when all the dots are eaten. There are also four ghosts in the maze who 
try to catch Pac-Man, and if they succeed, Pac-Man loses a life. Initially, he has three 
lives, and gets an extra life after reaching 10,000 points. 

There are four power-up items in the corners of the maze, called power dots (worth 
40 points). After Pac-Man eats a power dot, the ghosts turn blue for a short period (15 
seconds), they slow down and try to escape from Pac-Man. During this time, Pac-Man 
is able to eat them, which is worth 200, 400, 800 and 1600 points, consecutively. The 
point values are reset to 200 each time another power dot is eaten, so the player would 
want to eat all four ghosts per power dot. If a ghost is eaten, his remains hurry back to 
the center of the maze where the ghost is reborn. At certain intervals, a fruit appears 
near the center of the maze and remains there for a while. Eating this fruit is worth 100 
points. 

We restricted our studies to the first level, so the maximum achievable score is 
174 • 10 + 4 • 40 + 4 • (200 + 400 + 800 + 1600) = 13, 900 plus 100 points for each 
time a fruit is eaten. 

In the original version of Pac-Man, ghosts move on a complex but deterministic 
route, so it is possible to learn a deterministic action sequence that does not require any 
observations. In Ms. Pac-Man, randomness was added to the movement of the ghosts. 
This way, there is no single optimal action sequence, observations are necessary for 
optimal decision making. We used the implementation available on the Internet llT9l . 

Ms. Pac-Man as an RL task Ms. Pac-Man meets all the criteria of a reinforcement 
learning task. The agent has to make a sequence of decisions that depend on its obser- 
vations. The environment is stochastic (because the paths of ghosts are unpredictable). 
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There is also a well-defined reward function (the score for eating things), and actions 
influence the rewards to be coflected in the future. 

The full description of the state would include (1) whether the dots have been eaten 
(one bit for each dot and one for each power dot), (2) the position and direction of 
Ms. Pac-Man, (3) the position and direction of the four ghosts, (4) whether the ghosts 
are blue (one bit for each ghost), and if so, for how long they remain blue (in the 
range of 1 to 15 seconds) (5) whether the fruit is present, and the time left until it 
appears/disappears (6) the number of lives left. The size of the resulting state space is 
huge, so some kind of function approximation or feature-extraction is necessary for RL. 

The action space is much smaller, as there are only four basic actions: go 
north/south/east/west. However, a typical game consists of multiple hundreds of steps, 
so the number of possible combinations is still enormous. This indicates the need for 
temporally extended actions. 

We provide mid-level domain knowledge to the algorithm; we use domain knowl- 
edge to preprocess the state information and to define action modules. On the other 
hand, it will be the role of the policy search reinforcement learning to combine the 
observations and modules into rule-based policies and find their proper combination. 

Ms. Pac-Man is attractive for the illustration, because the environment - alike to 
many other games - has some life-like characteristics, and the agent has to solve inter- 
related tasks and identify interconnected behaviors in a changing environment. These 
interrelated tasks are not separable and have mutual effects on each other 

The purpose of our demonstration is to assemble a common sense RL architecture 
by utilizing certain RL techniques. We shall argue later that feature extraction is the 
main bottleneck for this architecture and shall consider what kind of features are ad- 
vantageous for RL. 

3.2 Development of rule-based policies and learning macros 

A rule-based policy is a set of rules with some mechanism for breaking ties, i.e., to de- 
cide which rule is executed, if there are multiple rules with satisfied conditions. Actions 
can last for a while and some of them may be applied in combinations. We reproduced 
the actions of |25j, examples being like 

(i) ToDot: Go towards the nearest dot, 

(ii) ToPowerDot: Go towards the nearest power dot, 

(iii) FromPowerDot: Go in direction opposite to the nearest power dot, 

(iv) ToEdGhost: Go towards the nearest edible ghost, 

(v) FromGhost: Go in direction opposite to the nearest ghost, 

among others. Here are some examples for observations used for rule constructions. For 
the full list, see the original publication ll25l : 

(i) NearestDot: Distance of nearest dot, 

(ii) NearestPowerDot: Distance of nearest power dot, 

(iii) NearestGhost: Distance of nearest ghost, 

(iv) NearestEdGhost: Distance of nearest edible (blue) ghost. 
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Observations can be extended to conditions. For the sake of simplicity, conditions 
were restricted to have the form [observation] < [value], [observation] 
> [value], [action] +, [action] -, or the conjunction of such terms. For ex- 
ample, 

{NearestDot<5) and (NearestGliost>8 ) and {FromGliost + ) 

is a valid condition for our rules. 

Once we have conditions and actions, rules can be constructed easily. In our imple- 
mentation, a rule has the form "if [Condition], then [Action] ." For ex- 
ample, 

if (NearestDot<5 ) and (NearestGtiost<5 ) then FromGhost + 

makes a potential rule. Each rule has a priority assigned. When the agent has to make 
a decision, she checks her rule list starting with the ones with highest priority. If the 
conditions of a rule are fulfilled, then the corresponding action is executed, and the 
decision-making process halts. 

If a rule with priority k switches on an action module, then the priority of the action 
module is also taken as k. Intuitively, this makes sense: if an important rule is activated, 
then its effect should also be important. If a rule with priority k switches off a module, 
then it is executed, regardless of the priority of the module. The procedure is depicted in 
Fig. [3] Rules can be pre-wired, or can be created randomly, and a global search method 
can select the good ones and assign their priorities ll25l . Thus, learning can be concerned 
with (i) combining different conditions and actions to form rules and (ii) deciding about 
the priorities of the rules. This policy optimization is accomplished by the cross-entropy 
method. Macros are simultaneously used rule combinations. 

The optimization procedure gives rise to a 'low-complexity policy': there can be 
many rules in the policy but only a few rules can be concurrently active at a time. The 
effective length of policies is biased towards short policies. Frequent, low-complexity 
rule combinations can be used as building blocks in a search for a more powerful but 
still low-complexity policies. The result, i.e. the optimized Ms. Pac-Man policy may be 
seen as a new action that can be selected under certain conditions. For more details on 
the rules, see ll25l . 

3.3 Score, macro values and macro sequences 

Score Global policy optimization gave rise to good performance: we used only the first 
level and the average score was 6680 and the average life was 2.82, out of the 3 possible 
fives. The maximum achievable score is 1 74 • 10 + 4 • 40 + 4 ■ (200 + 400 + 800 + 1600) = 
13, 900. Ms. Pac-Man was not able to eat all ghosts after eating a power dot. From the 
score it follows that she could eat about 2 to 3 ghosts after consuming the power dot. 

Behavioral states Macros correspond to a group of simultaneously active rules and 
can be interpreted as behavioral states or behavioral patterns with the action equivalent 
to the appfication of the actions of the macro. A very little set of macros emerged in the 
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NearestDot 
KearestGhost 
earestEdShost 



GhostDensity 



ToDot + 
FEomGhost - 
ToEdGhost - 



ToLowerGhostD 



[1] if (NeaEestGhost<4) 

and (ToLowerGhostDensity- 

then E"romGhost+ 

[1] if (NeaEestGhost>7) 

and ( Junct ionSaf ety>4 > 

then FromShost- 

[2] if (NearestEdGhostog) 

then ToEdGhost+ 

[2] if (NeaEestEdGhost>9 9> 

then ToEdGhost- 



priority list 



time step t+1 



Fig. 3. Decision-making for Ms. Pac-Man agent. At time step t, the she receives the actual obser- 
vations and the state of her action modules. She checks the rules of her priority list in order, and 
executes the first rule with satisfied conditions. Plus sign means that an action module is enabled. 



No. 


Code 


Score 


V{s) 


Description 






1. 


010011 


5133 


6621 


FromGhost- 


& 


FromPowDot 


2. 


010101 


6812 


7024 


FromGhost- 


& 


ToEdGhosts 


3. 


011001 


4006 


6606 


FromGhost- 


& 


ToPowDot 


4. 


100011 


6732 


6142 


FromGhost+ 


& 


FromPowDot 


5. 


100101 


6024 


6455 


FromGhost+ 


& 


ToEdGhosts 


6. 


101001 


1770 


6100 


FromGhost+ 


& 


ToPowDot 



Table 1. Behavioral states, or macros, their codes and their values. 

1^* column: number of behavioral states. 2"^* column: individual digits denote if an action is 'on' 
or 'off. Meaning of the digits from left to right: FromGhost+, FromGhost-, ToPowDot, 
ToEdGhosts, FromPowDot, and if Constant>0 then ToNearestPill + . The last 
digit of the code applies under all conditions. S"^"* column: Ms. Pac-Man scores after CEM op- 
timization of actions during the time when behavioral state is effective. Averages for 5000 runs. 
Bold italic: optimization of the actions of the macro increased the score above 6680 (the average 
score of inherited behavior) and policy greedification is possible. 4"* column: estimated values 
of inherited behavioral states V(s) upon optimization of the actions. Averages for 5000 runs. 
S**" column: description of the behavioral states. 



Ms. Pac-Man game, there were only 6 macros. They are listed in Table[T] The values of 
the behavioral states can be estimated, e.g., by the TD method. 
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Since Table [T]is very small, it is easy to pre-wire these rules. Since CEM is a prob- 
ability based, selective, global optimization method, we consider it as the model of 
evolution here. These pre-wired rules and their respective priorities give good chances 
for the agent to survive for a while and to collect additional information to improve 
performance via individual learning. We note that thought experiments may also apply 
CEM, but CEM may be too dangerous for experimentation. 

Some macros often occur consecutively and could be joined into macro sequences. 
The formation of macro sequences means a higher level of abstraction, which basically 
incarnates a new behavior An example of this phenomenon is the observed zig-zagging 
behavioral pattern, when the Ms. Pac-Man agent performs turns towards and from the 
power dot in each step: out of the many potential choices, CEM selected a rule set at 
the second priority level: 

P2: if GhostDensity<l . 5 and NearestPowerDot<5 then 

FroinPowerDot + 
P2: if NearestEdGhost>99 then ToPowerDot+ 

That is, if ghost density is low and the power pill is close then move away from the 
power pill. If this rule is on, then the other rule is considered: if there is no edible ghost 
then move towards the power pill. These two rules at priority level 2 achieve good 
performance since they give rise to a zig-zagging behavior close to the power pill until 
ghost density is small. It looks like as if the Ms. Pac-Man were waiting for the Ghosts 
and to eat the power dot when they approach in order to have a better chance to catch 
them afterwards. This is an emerging behavior, since there is no rule for 'waiting'. It is 
unrelated to foreseeing that ghosts may come closer in the future, it is simply the result 
of selection under global policy optimization. Zig-zagging may also gives rise to a trap 
as shown in Fig. |4(a)| 

There are proper conditions for the zig-zagging behavior, so one could turn zig- 
zagging ON and OFF and zig-zagging could be a new macro. The repetitive behavioral 
pattern can be easily identified during execution and it is easy to compress it into a new 
module that may undergo CEM optimization (but see Lamarckism). In the following, 
we consider individual learning during a small number of episodes, i.e., during the 
game. 

3.4 Individual Learning 

Policies and macros can be optimized further e.g. by: 

1. considering each macro state as a sub-problem as long as the macro is 'ON' and 
applying CEM during this time interval, 

2. computing the values of the states, inspecting the TD error histogram and then 
building up predictors and decision surfaces for large TD errors and thus increasing 
the 'state space'. Then, using the new states and using the rule set, a new CEM 
optimization may improve performance. 

3. Alternatively, we may apply a look-ahead procedure. 
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Fig. 4. Lack of look-ahead. 

(a) Clyde chooses a random move and moves to the right instead of chasing Ms. Pac-Man [left] 
and [middle]. Then Clyde switches behavior, approaches Ms. Pac.Man, and comes form the trou- 
bled direction: it crosses the power dot and then she can catch Ms. Pac-Man [right] 

(b) [left]: Ms. Pac-Man could escape, [middle]: She chases the edible ghost, and [right]: falls into 
the trap closed by Pinky. 



Note that the first option develops a new behavior for each state without using additional 
information. This method can help since it extends the range of priorities by introducing 
new priority set within each 'state'. The other two options take advantage of additional 
information collected during the game. We consider the three options below. 



Subtask optimization The optimization of individual macro states via CEM may give 
rise to novel behaviors that collect more rewards in each state. We executed this program 
and could increase the collected rewards in each state of the Ms. Pac-Man game. Then 
- if Markov condition is satisfied - greedy replacement of the new behavior should 
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improve performance. However, there is no warranty for improvement if the Markov 
condition is not satisfied. The third column of Table[T]shows average performance after 
greedy behavior optimization in each state. Since average performance equals to 6680 
points, performance improves only by changing behaviors in state 2 and state 4 and 
not in the other states. For example, optimizing behavior for state 6, we get very poor 
performance (1770 points on average) upon greedy policy optimization. 

Subtask optimization does not suit individual learning well, since it applies CEM 
and is thus slow. It is also dangerous since overall performance may easily drop even if 
optimization of behavior is successful. 

Decision surface using TD errors We have estimated the values of the behavioral 
states (Table[T] fourth column) and computed the histogram of TD errors (Fig.|5). Large 
TD errors occur (Fig. |5(a)) that we made visible by excluding the most frequent TD 
error that determine behavioral success (Fig. |5(b)) . Consider state 2, for example. The 
average reward collected in this state is about 330. This is cumulated from dots and 
ghosts. The largest TD peak is at around -150, i.e., in this case Ms. Pac-Man collected 
about a single ghost and very few dots. The second peak is at around 600 and it is 
broader. This corresponds to eating two ghosts and about 30 dots. The third and the 
fourth peaks are at around 1,400 and 2,800 meaning that Ms. Pac-Man could collect 
three or four ghosts, respectively. 

It would be ideal for Ms. Pac-man to avoid TD errors -150, 600, and 1,400 and to 
collect all four ghosts. One may build a classifier for this purpose by collecting situa- 
tions when all four ghosts will be captured and all other situations. The classifier then 
forecasts the reward and defines an RL problem, which either wins (reaches the goal) 
or looses. CEM can be applied to the optimization of actions starting from a number of 
situations collected during the game. This procedure is a special form of goal-related 
feature extraction. It can be solved by RL methods and we did not execute it. The situ- 
ation corresponds to categorical perception (see, [jl I'l and references therein), which is 
to be favored if factored description is not available. We will come back to this point 
later We will argue that searching for close-to-deterministic factors, applying factored 
FSM approximation together with look-ahead is more flexible. 

3.5 Look-ahead module 

In the previous section, we argued that decision surfaces should help decision making. 
However, there are potential problems with decision surfaces even in the Ms. Pac-Man 
game that we list here. 

World may change: Considerable part (if not all) of the knowledge that we encode 
into the decision surface may be lost if the world changes. There are many vari- 
ables of the world and changes to any of them may spoil decisions. We note that 
human recognition utilizes both component based recognition, when decision about 
a category involves the presence or absence of the components |3|, and categori- 
cal perception, or example based learning, when sensory observation is changed 
around the decision surfaces (see e.g., f[T\ ). This latter version of learning appears 
only if component based recognition is not useful. Component based recognition 
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(b) Histogram of TD errors without the main peak 



Fig. 5. TD error distribution. Distributions are uneven indicating that decision making can be 
improved. 



seems easier. As an example, consider face recognition. The hierarchical structure 
of the components, i.e., the components of the face do not change when the envi- 
ronment changes, e.g., if light condition changes. Furthermore, component based 
recognition works also under occlusion. 
Learning is hard: It is hard to learn to learn decision surfaces in real life scenarios. 
Even in the artificial example of Ms. Pac-Man a slight change of the corridors of 
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the labyrinth or the actual number, position, and type of the Ghosts may make a big 
difference. Such sensitivity to small changes of individual variables or combina- 
tions of the variables may be be compensated by an enormous number of training 
examples, but it becomes prohibitively expensive. Whereas the component based 
recognition suits high dimensional problems and it is typically easy to learn, cate- 
gorical perception is hard to learn. Categorical perception appears mostly in prob- 
lems, where the inherent dimension is low. Examples include decisions about fe- 
male or male faces, borders between colors, and so on. 

Taken together, example based learning of decision surfaces in the Ms. Pac-Man prob- 
lem is fragile and takes long. 

It is however, easy to learn the behavior of the components. For example, any brick 
of the wall is a component in the Ms. Pac-Man game. Bricks have coordinates and they 
do not move. Dots are different: they also have coordinates, but they disappear when 
Ms. Pac-Man crosses them. Power dots are similar, but they also change the behavior 
of the ghosts. Ghosts are either escaping or they are chasing Ms. Pac-Man. State tran- 
sitions are simple, but chasing behavior has some randomness. Fruits (that we did not 
use) appear randomly, they have close-to-deterministic trajectories: upon observation, 
a decision could be made if the fruit trajectory can be intercepted or not. 

Thus, sophistication of the Ms. Pac-Man game comes from the configurational di- 
versity, whereas the behavior of the components are very simple, sometimes determinis- 
tic, and they are easy to learn "on the fly". Learning of the dynamics of the components 
enables factored description and produces a factored world model with low branching 
ration that can produce potential configurations. Changes in the world, like different 
labyrinth, or different ghost speed are again easy to observe and learn. Finally, since 
branching ratio is low, such models are easy to run and can be used for look-ahead 
(LA). LA is commonly used in RL, including games (see, e.g., 1131113311 ). 

Here, we implemented and non-perfect fFSM model, a model of the game0 Look- 
ahead used this game model and was implemented as a set of trajectories, where each 
trajectory contained a future state and the rewards collected up to that future state. 
Trajectories were enumerated using a depth-first search strategy applied on the state 
graph of the original model. If look-ahead model approximates the world closely, then 
it can improve policy starting at (any) state sq: 



with discount factor 7 = 1. That is, estimated value may make only a relatively small 
contribution to the value estimation and that estimated value is about future states closer 
to the end of the game when they exhibit smaller estimation errors. 

In the look-ahead model, branching factor was reduced by the following heuristics: 

We have bypassed the learning of the fFSM, which is very simple for Ms. Pac-Man and would 
be misleading. We believe that pulling out the close-to-deterministic features is the most rele- 
vant piece of the learning problem. This problem is treated elsewhere 1171 
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- We used a deterministic model and reduced the branching factor since each stochas- 
tic event would require a branch in the look-ahead tree. Approximations: (i) non- 
edible ghosts are always chasing Ms. Pac-Man, (ii) edible ghosts are fleeing away 
from Ms. Pac-Man, and (iii) Ms. Pac-Man may stop at positions, where it used to 
perform zig-zagging. 

- No agent can turn back to a previously visited position. It renders the state graph 
to a simple tree - where backtracking is not possible - reducing branching factor 
further 

- The trajectory is not evaluated further in case of any event that resets the model, 
such as the death of Ms. Pac-Man agent. 

Tables |2]and[3] show the results. 



Look-ahead none 123456789 45 

Average score 6680 6089 4702 4677 4622 6013 6266 6533 6561 6601 7130 

Table 2. The effect of look-ahead on the score for 3 lives. Single level game, which is over when 
all dots are eaten. 



Table 3. The effect of look-ahead on the average life lost out of 3 lives. Average score per life 
increased by about a factor of 2.4 using look-ahead. Single level game, which is over when all 
dots are eaten. 



For this illustration we used a single level game. The optimization algorithm did 
not make use of the information that a life (out of three) is lost, so Ms. Pac-Man could 
make sudden transitions in certain cases. Score per life improved considerably by look- 
ahead, it increased from about 2,400 (no look-ahead) to 5,700, i.e, by more than a factor 
of 2. In turn, lookahead helped Ms. Pac-Man to escape traps. Overall score improved 
from 6680 to 7130 for a relatively high lookahead value. This is so, because score can 
be increased considerably if more ghosts are catched after eating a single power dot 
and deep look-ahead is needed to judge this. There is an interesting interplay between 
categorical perception and look-ahead, since a decision surface may work well: the 
only relevant variable (apart from the look-ahead estimated value) is the ghost density, 
which is a single dimension, so example based learning could work here. The learned 
decision surface would thus implicitly measure ghost density, which could be used in 
other situations, too. 

Lookahead decreases performance if it is shallow. This is due to the non-Markovian 
state value estimation that overrides inherited behavior. The estimated value of eating 



Look-ahead 
Avg. life lost 
Avg. score/life 



none 45 
2.82 1.25 
2368 5704 



18 



the power dot is high due to the zig-zagging behavior that have emerged. However, 
lookahead trusts value estimation, overrides zig-zagging, and that decreases the score. 
Deeper look-ahead overcomes this problem. 

4 Discussion 

We summarize the main features of the architecture, relate them to RL algorithms and 
consider the learning demands. Certain cognitive aspects will also be mentioned. We 
analyze the architecture from the point of view of evolutionary, individual, and social 
learning. 

We emphasize that fFSM is very easy to learn in the Ms. Pac-Man game and it is 
very simple to run. The memory and computer time demand of look-ahead depends on 
the branching ratio and the size of the state space. fFSM, however, works in the space 
of factored variables and may considerably reduce the exponent of the size of the state 
space of RL. 

We have given an example that feature extraction is hard: the machine learning 
algorithm should figure out high level concepts, like ghost density, to make a correct 
decision if it should eat the power dot or, instead it should escape in order to attract 
more ghosts to another power dot. Such extraction of low-dimensional observations, or 
features is an unsolved problem in machine learning for the general case. Ghost density 
is an illuminating example: we argued that TD error can be used to learn the decision 
surface and the decision will reflect ghost density. The parameter 'ghost density' could 
be used in other scenarios, too. However, in the general case it is unclear how to separate 
this factor from the power dot, or the particular structure of the labyrinth where the 
decision is to be made. 

Modules: modules are spatial-temporal entities. Consider playing volleyball: in this 
scenario, there are multiple modules and accompanying activities, such as catching 
the ball and running, which involves arm and body modules. Ms. Pac-Man is sim- 
ple from this point of view, we did not have to consider problems like the decou- 
pling of catching the ball and running. Such problems, however, may pose consider- 
able challenge for other applications. If arm and body modules can be properly de- 
coupled, then look-ahead can take advantage of the factored FSM description in both 
domains. In the example above, the FSM may be written as a rule: "if Run+ and 
CatchTheBall+ then PassTheBall+". An analog example for Ms. Pac-Man 
is "if DotOn+ and IStepOnIt+ then DotOn-". 

In order to illustrate the problems of missing information we showed two examples 
in Fig.|4] In one situation (Fig. [4(a) i, Ms. Pac-Man is waiting on the wrong side of the 
power dot. In the other case (Fig. |4(b)) , Ms. Pac-Man is chasing and edible ghost and is 
surrounded by non-edible ones. For the human observer, these are simple cases. But the 
situation becomes different also for human observers if only the odor of the power dot 
and the odor of the ghosts are available. We note that for such poor sensory systems, 
temporal depth could hardly help, whereas TD error would clearly indicate that there 
are occasions that largely differ Additional (e.g., visual) information is needed to build 
the fFSM in order to use look-ahead. 
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The fFSM based description has a number of advantages: fFSM model can be the 
zeroth approximation and can be enriched, e.g., stochastic aspects can be included, if 
more information becomes available. Also, fFSM model offers considerable improve- 
ments at moderate computer time requirements. For example, changes in the labyrinth 
can be incorporated, new FSMs like doors, which can be open or closed can be included. 

An fFSM becomes costly if branching ratio is large. In this case, sampling methods 
can be used, but look-ahead becomes time-consuming making look-ahead useless under 
time constraints. Furthermore, look-ahead will not work in partially observed worlds 
when value estimation may be very poor Cognitive aspects include certain forms of 
autism. In autism, sensing of the 'partner's mind' is sometimes spoiled; emotion de- 
tection or emotion understanding are weak. Then, the lack of information about the 
emotional state and the intentions of the partner makes computational costs high since 
branching ratio becomes high. At the same time, estimation of rewards is spoiled under 
such circumstances. 

The fFSM description is advantageous in event-learning formalism described in 
II32I . In event-learning optimization decides about the desired next state creating a sub- 
problem in the RL task. This subproblem is given to a backing controller, which tries 
to solve it. Beyond the advantage that desires may be factored, event-learning supports 
flexible definition of time steps, can utilize backing robust controllers making use of 
crudely learned inverse dynamics and thus it fits the module description we used here. 

State description as the list of available modules and choice of action as the choice 
of a module was first described in lfT2t . The choice of a single module was extended to 
the choice of low-complexity modules, the macros in 125] by means of slow, evolution- 
hke, globally optimizing policy search, the CEM. 

Low-complexity macros might serve RL since they are factored. If states defined 
by such macros is Markovian then large RL problems can be optimized since learning 
time will scale polynomially if sampling is used ll26l . A further advantage appears if 
the model, i.e., the transition probability matrix is learned since then the exploration- 
exploitation dilemma is resolved [29J . 

With regards to macros, we have found experimentally that (i) the number of op- 
timized macro states can be very small, but (ii) these states may be non-Markovian. 
Non-Markovity can be early detected by measuring the distribution of the TD errors 
in each state, which is easy. One may try to optimize each temporally extended macro 
states by applying policy optimization during the time the macro is used followed by 
greedification. 

Extension of state description by temporal depth is not feasible in general, since 
temporal depth comes to the exponent of the size of the state space. For example, in 
Ms. Pac-Man huge temporal depths are required to improve cumulated rewards (Ta- 
ble ID. On the other hand, fFSM approximations may be easy to learn, look-ahead 
comes cheap for fFSM models and even crude approximations may improve perfor- 
mance. Look-ahead may also improve value estimations, since it works by computing 
immediate rewards and then adds the estimated (and discounted) values of the inherited 
macro states. 

From the point of view of cognition, inherited CEM optimized policy provides in- 
herited values for the behavioral states. Look-ahead can decrease the error of these 
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inherited values in the actual world since it estimates the rewards and discounts the 
effect of the inherited policy. Look-ahead thus changes behavior. However, inherited 
rules and related behaviors may reappear by aging if working memory degrades since 
look-ahead relies on working memory. 

Further work is needed to provide a solid mathematical framework for the look- 
ahead extended approximately deterministic fFSM description (LE-fFSM). LE-fFSM is 
promising since look-ahead can diminish non-Markovity and then factored RL methods 
with attractive scaling properties may apply. However, the feature extraction problem 
that suits LE-fFSM is largely unsolved. Neurally motivated hierarchical exact matrix 
completion on spatio-temporal representations IITtI seem appealing in this respect. 

4.1 Evolutionary, individual and social learning 

It has been a long-standing problem how to distinguish evolutionary, individual and 
social learning JS). Our architecture offers the following nomenclature: 

Evolutionary learning is a selective globally optimizing method that gives rise to 
low-complexity rules, i.e., macros and behavioral states. The result of evolution- 
ary learning is simple and is easy to pre-wire. 

Individual learning learns TD errors, optimizes modules, e.g., fine tunes the thresh- 
olds of the rules, computes values and TD errors of macro-states, may increase state 
space by decision surfaces, learns factored (stochastic) FSM, applies look-ahead to 
predict the behavior, learns to control the individual fFSMs (e.g., switches them on 
and off), may learn the inverse dynamics and apply robust controller to improve 
control and to solve sub tasks. 

Social learning requires that the state of other agents including their value estimations 
be sensed, so other agents become fFSMs in the model. The lack of this knowledge, 
however, means that stochastic FSM agents with hidden variables enter the opti- 
mization problem. Such tasks are hard, value estimation may become highly im- 
precise, and may prohibit collaboration and social learning, especially since other 
agents are not stationary, they learn lITSl . 

4.2 The concept of state in MDPs 

Close-to-deterministic factored finite-state machine description seem attractive for 
MDPs, since Laplace's demon is a good initial approximation of the world for dif- 
ferent reasons 11351221 and fFSMs are tractable from the point of view of problem size 
in reinforcement learning. However, the extraction of close-to-deterministic variables 
(features) is an unsolved problem. Low-complexity, temporally extended and close-to- 
deterministic factors can satisfy the concept of states in MDPs to a large extent. The 
environment as well as the agent's action may switch the factors ON and OFF and it 
would correspond to state-action-state transitions. Controlled optimal switching is then 
the task of MDPs, which is the concept of event-learning 13211 . 
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5 Conclusions and outlook 

We considered spatio-temporal modules in this paper We showed how to make the tran- 
sition from the modular description to state description and how to find out if states are 
Markovian or not. We treated different options for achieving the Markovian property 
and suggested that the learning of factored FSM is the key since it suits look-ahead 
based improvement of value estimation. We also argued that learning of the factors in 
the Ms. Pac-Man problem happens to be easy and noted that the general problem is un- 
solved in machine learning. We introduced look-ahead into the architecture by means 
of a deterministic and approximate fFSM, and conjectured that factored RL models 
with fFSM look-ahead have many desirable features, including close to Markovian de- 
scription, polynomial scaling and diminishing the exploration-exploitation dilemma by 
model learning. 

We sketched some cognitive aspects of the model; one of the origins of autistic 
behavior and changes of behavior if the capacity of working memory changes. 

We considered the architecture from the point of view of evolutionary, individual 
and social learning and tried to separate key aspects of this long-standing problem. 
We have shown that CEM may give rise to simple low-complexity policies, which are 
easy to inherit. A particular property of individual learning is to construct fFSM, use it 
for look-ahead in order to diminish errors of value estimation and the non-Markovian 
characteristics of inherited macro states. Social learning, in our framework, depends on 
sensory information; if values or TD errors of the partners can be sensed then fFSM 
description may be sufficient and collaborative planning may be eased considerably. 
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